import warnings
warnings.simplefilter(action='ignore')
import pandas as pd
import statsmodels.api as sm
import numpy as np
import pyfonts
import scipy as sp
import statsmodels.api as sm
import statsmodels.tsa as tsa
import statsmodels.formula.api as smf
import statsmodels.graphics.tsaplots as tsaplots
from pykalman import KalmanFilter
from mplsetup import *
np.random.seed(15)
intToDay = {0: "Mon", 1: "Tue", 2: "Wed", 3: "Thu", 4: "Fri", 5: "Sat", 6: "Sun"}
pd.set_option('display.max_rows', 200)
What's the Data?
I was a developer a while back on a Realm of the Mad God (RotMG) private server called Runes' Domain. The game is an online multiplayer roleplaying permadeath bullet hell; basically, dodge bullets until you die and lose your character forever, losing all your progress. Might sound terrible but at least you get to die alongside your online friends! Here's me playing on the base game RotMG:

A private server is a fan-made version of the original game; the community around RotMG private servers is small but dedicated. I wrote a headless Python bot that connected to the game and collected data on various statistics, one of them being the number of players online. Data was stored using SQLite every 30 seconds over a two month period from November to January, resulting in over 160,000 observations.
x = pd.read_csv("../data/playersOnline.csv")
x.shape
(167908, 2)
Game player data is particularly interesting to analyze for a few reasons. I hypothesize there will be multiple seasonal effects, one for the daily fluctuation in players and another weekly seasonal effect due to the weekend. We'll look at player behavior, such as the most popular times that players are online, telling us when to time new content releases or events for maximum participation. Lastly, statistical time series models can be helpful to predict how many players will be on in the future.
Let's perform a comprehensive statistical analysis on this time series data! First, we'll define some research questions that would be insightful to pursue, especially in the context of an online game.
Research Questions
- How will we deal with missing data, especially in the context of the game?
- Can we identify any general trends or cyclic player behaviors? If so, explain the trends and seasonalities with respect to events in the game.
- Player analysis:
- What are the peak player times?
- On average, is there a day of week where players are more likely to be online?
- Can we determine a statistically sound model for forecasting? If so, how far in the future can we forecast? Which models perform the best?
To answer these questions, we will:
- Clean the data and handle data imputation carefully
- Perform exploratory analysis by analyzing player behavior, looking at trends + daily/weekly cycles, and inspecting the autocorrelation structure
- Compare and contrast forecasting models based on the autocorrelation structure of the data and data quality
Let's begin!
Cleaning the Data
When collecting the player count data, I sent a chat request asking for players online, and upon a new chat message collect the player data subtracting my collecting account's contribution to the playercount. Because there was a slight delay between the request and response, the effect of this is when there were zero players on and I loaded in one of my test accounts between the request/response time, the playercount would be erroneously set to -1. This happened a few times:
x.timeScraped = pd.to_datetime(x.timeScraped, unit = "s")
x.playersOnline = x.playersOnline.astype("Float64")
x["year"] = x.timeScraped.dt.year
x["month"] = x.timeScraped.dt.month
x["day"] = x.timeScraped.dt.day
x["hour"] = x.timeScraped.dt.hour
x["minute"] = x.timeScraped.dt.minute
x["second"] = x.timeScraped.dt.second
x[x["playersOnline"] < 0]
| timeScraped | playersOnline | year | month | day | hour | minute | second | |
|---|---|---|---|---|---|---|---|---|
| 58657 | 2021-12-08 23:39:09.988447428 | -1.0 | 2021 | 12 | 8 | 23 | 39 | 9 |
| 61527 | 2021-12-09 23:46:00.350245953 | -1.0 | 2021 | 12 | 9 | 23 | 46 | 0 |
| 62007 | 2021-12-10 03:46:53.820574760 | -1.0 | 2021 | 12 | 10 | 3 | 46 | 53 |
| 64879 | 2021-12-11 03:52:46.326685905 | -1.0 | 2021 | 12 | 11 | 3 | 52 | 46 |
| 66980 | 2021-12-11 21:34:48.009655714 | -1.0 | 2021 | 12 | 11 | 21 | 34 | 48 |
| 75693 | 2021-12-15 23:10:54.268729925 | -1.0 | 2021 | 12 | 15 | 23 | 10 | 54 |
| 103858 | 2021-12-25 22:02:42.676372051 | -1.0 | 2021 | 12 | 25 | 22 | 2 | 42 |
| 103860 | 2021-12-25 22:03:35.176578283 | -1.0 | 2021 | 12 | 25 | 22 | 3 | 35 |
We'll clip these values to zero which is more representative of actual players online. Let's also downsample the data to look at the average player count per hour instead to make the forecasting more tractable. We perform a group by and create a column representing the weekend.
x.loc[x[x["playersOnline"] < 0].index, "playersOnline"] = 0
x1 = x.groupby(by= ["year", "month", "day", "hour"]).agg({"playersOnline" : ["mean"]})
x1 = x1.reset_index(col_level = 1)
x1.columns = x1.columns.droplevel(0)
After aggregation we are left with 1,449 total hours. Every row represents one hour of a day:
print(x1.shape[0])
x1.head(5)
1449
| year | month | day | hour | mean | |
|---|---|---|---|---|---|
| 0 | 2021 | 11 | 16 | 0 | 7.135135 |
| 1 | 2021 | 11 | 16 | 1 | 7.191667 |
| 2 | 2021 | 11 | 16 | 2 | 5.766667 |
| 3 | 2021 | 11 | 16 | 3 | 3.883333 |
| 4 | 2021 | 11 | 16 | 4 | 2.641026 |
Handling Missing Rows¶
The example below shows on January 4, 2022 only 10 hourly datapoints were collected; we are missing data from 8 AM to 10 PM. Before proceeding to analysis we'll need to deal with missing rows as most time series statistical models require evenly spaced time for modeling.
x1[(x1["year"] == 2022) & (x1["month"] == 1) & (x1["day"] == 4)]
| year | month | day | hour | mean | |
|---|---|---|---|---|---|
| 1027 | 2022 | 1 | 4 | 0 | 26.091667 |
| 1028 | 2022 | 1 | 4 | 1 | 37.441667 |
| 1029 | 2022 | 1 | 4 | 2 | 29.858333 |
| 1030 | 2022 | 1 | 4 | 3 | 25.949153 |
| 1031 | 2022 | 1 | 4 | 4 | 18.741667 |
| 1032 | 2022 | 1 | 4 | 5 | 13.366667 |
| 1033 | 2022 | 1 | 4 | 6 | 12.641026 |
| 1034 | 2022 | 1 | 4 | 7 | 11.298246 |
| 1035 | 2022 | 1 | 4 | 21 | 25.636364 |
| 1036 | 2022 | 1 | 4 | 22 | 27.75 |
The missing hours are not present in the dataframe. How many hours we are missing in total? We need to check which days are missing hours and add rows corresponding to the missing hours. One way this can be accomplished is by counting the number of occurrences of each hour in a day; if it's less than 24 then there are missing values.
md = x1[["year", "month", "day"]].value_counts().reset_index().rename(columns={"count": "# hours present"})
md = md[md["# hours present"] < 24].sort_values(by = "# hours present")
s = 24 * md.shape[0] - md["# hours present"].sum()
print("Number of hours missing: {}".format(s))
md.head(5)
Number of hours missing: 231
| year | month | day | # hours present | |
|---|---|---|---|---|
| 69 | 2022 | 1 | 2 | 1 |
| 68 | 2021 | 12 | 28 | 1 |
| 67 | 2022 | 1 | 3 | 3 |
| 66 | 2021 | 12 | 14 | 4 |
| 65 | 2022 | 1 | 22 | 6 |
To add rows corresponding to the hours missing, we will iterate over each day and introduce new rows. We'll sort the rows by date so we can graph the series in order.
addNA = {i : [] for i in x1.columns}
for i, row in md.iterrows():
allHours = set(i for i in range(24))
check = x1.loc[(x1["year"] == row["year"]) & (x1["month"] == row["month"]) & (x1["day"] == row["day"] )]
missingHours = allHours.difference(set(check["hour"]))
for h in missingHours:
for key in addNA:
if key in ["year", "month", "day"]:
addNA[key].append(check[key].iloc[0])
elif key == "hour":
addNA[key].append(h)
else:
addNA[key].append(pd.NA)
x2 = pd.concat([x1, pd.DataFrame(addNA)], ignore_index=True)
x2 = x2.sort_values(by=["year", "month", "day", "hour"]).reset_index(drop = True)
A sanity check for adding the right amount of rows:
assert x2.shape[0] - 231 == x1.shape[0]
Add a date column for graphing and verify everything is in order:
x2["date"] = x2.year.astype(str) + "-" + x2.month.astype(str) + "-" + x2.day.astype(str).str.zfill(2)
x2["date"] = pd.to_datetime(x2["date"])
x2["dayOfWeek"] = x2["date"].dt.dayofweek
x2["weekend"] = x2["dayOfWeek"] > 4
x2["date"] = x2.month.astype(str) + "-" + x2.day.astype(str).str.zfill(2)
print(x2.head(5), "\n\n", x2.tail(5))
year month day hour mean date dayOfWeek weekend
0 2021 11 16 0 7.135135 11-16 1 False
1 2021 11 16 1 7.191667 11-16 1 False
2 2021 11 16 2 5.766667 11-16 1 False
3 2021 11 16 3 3.883333 11-16 1 False
4 2021 11 16 4 2.641026 11-16 1 False
year month day hour mean date dayOfWeek weekend
1675 2022 1 26 19 0.0 1-26 2 False
1676 2022 1 26 20 <NA> 1-26 2 False
1677 2022 1 26 21 <NA> 1-26 2 False
1678 2022 1 26 22 <NA> 1-26 2 False
1679 2022 1 26 23 <NA> 1-26 2 False
Seems like we are good to go to start our analysis! Or are we...
Always Look at the Data!¶
After the previous data manipulations each row should be a single hour between the dates November 16th and January 26th, sorted by ascending date. However, our previous algorithm relied on the existence of at least one hour for each day; otherwise we are missing entire days. A quick look at the total hours per day of week shows something suspicious:
x2.dayOfWeek.value_counts()
dayOfWeek 1 264 2 264 4 240 5 240 6 240 3 216 0 216 Name: count, dtype: int64
Monday and Thursday appear two times less than the maximum number of days. If our data really contains continuous data points, it's impossible to be two weeks behind; you can only be one day behind at worst!
Because our data is sorted by time ascending, we can verify hour by hour if we are missing any days if we subtract a lagged version of day against itself with the absolute value function. If any number 2 or greater appears then we are potentially missing a day (could be a new month).
potentialMissing = x2[(x2["day"] - x2["day"].shift(1)).abs() >= 2]
for i in potentialMissing.index:
print(x2[i-1: i+1], "\n")
year month day hour mean date dayOfWeek weekend
359 2021 11 30 23 17.475 11-30 1 False
360 2021 12 1 0 12.091667 12-01 2 False
year month day hour mean date dayOfWeek weekend
383 2021 12 1 23 <NA> 12-01 2 False
384 2021 12 3 0 <NA> 12-03 4 False
year month day hour mean date dayOfWeek weekend
1079 2021 12 31 23 29.366667 12-31 4 False
1080 2022 1 1 0 26.586207 1-01 5 True
year month day hour mean date dayOfWeek weekend
1295 2022 1 9 23 <NA> 1-09 6 True
1296 2022 1 11 0 <NA> 1-11 1 False
Our algorithm missed two days: December 2nd and January 10th. Let's add them in and resort by time:
for i, j, k, l in [[2021, 12, 2, 3], [2022, 1, 10, 0]]:
o = pd.DataFrame(
{"year": [i] * 24, "month": [j] * 24, "day": [k] * 24,
"hour": [i for i in range(24)], "mean": [pd.NA] * 24, "date": str(j) + "-" + f"{k:0>2}",
"dayOfWeek": [l] * 24, "weekend": [False]* 24
})
x2 = pd.concat([x2, o], ignore_index=True)
x2 = x2.sort_values(by=["year", "month", "day", "hour"]).reset_index(drop = True)
print("# of hours missing: {}".format(x2["mean"].isna().sum()))
print("% of hours missing: {:.1%}".format((x2.shape[0] - x1.shape[0]) / x2.shape[0]))
# of hours missing: 279 % of hours missing: 16.1%
We are missing 16.1% of hours over the time span we collected data on! This is a pretty significant proportion of missing data; we'll need to be very precise in data imputation and if we decide to forecast or not. And also during data cleaning :)
Now we are ready to explore the data!
Comprehensive Exploratory Analysis
We'll now perform an exploratory deep dive into the behavior of our players. First, let's look at some graphs to get a general sense of the data. Always an important step before diving into any type of analysis!
Visualization of Data¶
Our exploratory analysis begins by looking at the entire time series:
def graph(index, data, label, alpha, figsize, call):
fig, ax = plt.subplots(figsize = figsize)
call(index, data, ax, label, alpha)
ax.set_xlabel("Time")
ax.set_ylabel("Average players")
def firstGraph(index, data, ax, label, alpha):
ax.plot(index, data["mean"], label = label, alpha = alpha, color = lavender)
ax.set_title(label, fontsize = "x-large")
ax.set_xticks(index[::96], data["date"][::96])
graph(x2.index, x2, "Average Players in Runes' Domain", 1, (15, 5), firstGraph)
Few observations:
- There is a clear daily seasonal effect corresponding to peak times when most of the playerbase is on. Since our data is hourly, this corresponds with a seasonal period of $m = 24$.
- A trend certainly exists, especially beginning around December 19th when the number of players steadily increased. This has the implication of nonstationarity so we will have to deal with that when using ARIMA for forecasting.
- In January, the player count slowly decreased until it hit zero at the end. This is due to a developer on the team leaking the database with login credentials to the public, forcing us to shut down the game.
- There are a significant number of missing values, in particular, almost two days at the start of January is missing. I opted not to remove the missing data rows as our data shows a strong daily seasonal component; if we model our series after removing data the series loses time dependence; the seasonal correlation structure will be wrongly estimated.
- As this game was a private server, the number of players is low for an online game. However, we had a dedicated player base who would play very often and for multiple days in a row. The average player count over these two months was 13.2.
x2["mean"].mean()
13.246706063012407
Let's now plot the data to observe potential weekly seasonality effects. For this we'll need to slice the data into weeks and plot each week. We'll also plot the average over each week to get a general sense of any potential weekly seasonality.
# get indices to slice dataframe into weeks
indices = [0] + x2[(x2["dayOfWeek"] - x2["dayOfWeek"].shift(1)) == -6].index.to_list()
dfs = []
for i in range(len(indices) - 1):
dfs.append(x2[indices[i] : indices[i+1]].reset_index(drop = True))
dfs.append(x2[indices[-1]:].reset_index(drop = True))
dfs[0].index = dfs[0].index + 24
# create list of dataframes by week
meanWeekly = []
for i in range(24 * 7):
numObs = 0
runningTotal = 0
for df in dfs:
try:
if df.iloc[i].name == i and pd.notnull(df.iloc[i]["mean"]):
numObs += 1
runningTotal += df.iloc[i]["mean"]
except IndexError:
pass
meanWeekly.append(runningTotal / numObs)
def plotWeeklySeasonal(dfs : list, means : list, label, alpha, figsize):
fig, ax = plt.subplots(figsize = figsize)
for df in range(len(dfs)):
ax.plot(dfs[df].index, dfs[df]["mean"], alpha = 0.1, color = lavender)
ax.set_xticks([])
ax.plot([i for i in range(7*24)], means, label = label, alpha = 1, color = lavender)
ax.set_xticks([i*24 for i in range(7)], ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
ax.set_xlabel("Day of week")
ax.set_ylabel("Average players")
ax.set_title("Average players by day of week", fontsize = "x-large")
ax.legend()
plotWeeklySeasonal(dfs, meanWeekly, "mean", 1, (12, 5))
There seems to be a extremely slight weekend effect; Saturday and especially Sunday see the largest amount of average peak players, especially on Sunday. Moreover, Monday sees a strong drop in the number of players compared to Sunday, and seems to be the lowest average players out of all the days in the week. This lends credence to the hypothesis that our game may have a "Sunday dip." On the other hand, the peak playercount Tuesday evening rivals the peak during Saturday, so the weekly effect might be just noise. If you look very closely, the average player count slightly increases from Monday to Wednesday, decreases until Saturday, then increases again on Sunday.
Later we will take a look at some decomposition models which may be able to detect both the daily and weekly cycles automatically. For now let's dive into general player analysis on the daily and weekly behaviors.
Seasonality: Player Analysis¶
Before exploring general time series models, let's analyze the daily and weekly behaviors in detail, particularly with respect to seasonality.
Daily: Most/Least Active Times ¶
The previous graphs show a strong daily cycle. What's the most popular and least popular times on the game? We group by hour and looking at the mean:
def byHourGraph(index, data, label, alpha, figsize):
fig, ax = plt.subplots(figsize = figsize)
ax.plot(index, data["mean"], alpha = alpha, color = lavender)
ax.set_xticks(index[::4], index[::4])
ax.set_xlabel("Hour")
ax.set_ylabel("Average Players")
ax.set_title(label, fontsize = "x-large")
avgPlayersByHour = x2.groupby("hour").agg({"mean" : "mean"})
byHourGraph(avgPlayersByHour.index, avgPlayersByHour, "Average players by Hour", 1, (12, 5))
Since hours range from 0 to 23 in the dataset, this means 8 AM PST is the least popular time in our game where we have an average of 6 people online. This makes sense as most of our players are based in the US. In contrast, the core playerbase slowly trickles in after that and peaks around 20 players at 8 PM where it slowly dies down. Typically we would run a statistical test to determine whether the difference between the hours is significant. However, we'll hold off on this until we look at the next analysis.
Weekly: More Popular on Weekends?¶
Next we'll look at if player count is statistically significantly different between the days of the week. The boxplots of player counts by week and their CDF is shown below:
fig, (ax1, ax2) = plt.subplots(1, 2, gridspec_kw = {"width_ratios": [2, 1]}, figsize = (15, 5))
weeks = []
for i in range(7):
weeks.append(x2[x2["dayOfWeek"] == i]["mean"].dropna())
ax2.ecdf(weeks[-1], label=intToDay[i], color = lightBlue)
ax2.legend(labelcolor = textColor, facecolor = graphColor, edgecolor = borderColor)
ax2.set_ylabel("Total proportion of players")
ax2.set_xlabel("Players")
ax1.boxplot(weeks);
ax1.set_ylabel("Players")
ax1.set_xlabel("Day of week")
ax1.set_xticks([i for i in range(1, 8)], ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
pd.DataFrame(
[[intToDay[i], weeks[i].mean(), weeks[i].var(), weeks[i].shape[0]] for i in range(len(weeks))],
columns = ["Day", "Mean", "Variance", "# samples"])
| Day | Mean | Variance | # samples | |
|---|---|---|---|---|
| 0 | Mon | 12.500206 | 66.520418 | 178 |
| 1 | Tue | 13.451834 | 70.481632 | 198 |
| 2 | Wed | 13.261613 | 74.636376 | 251 |
| 3 | Thu | 12.659339 | 72.561600 | 209 |
| 4 | Fri | 13.555986 | 95.496532 | 221 |
| 5 | Sat | 12.723414 | 59.845036 | 209 |
| 6 | Sun | 14.625377 | 101.960130 | 183 |
At a first glance, it seems like there won't be any significant differences between the average playercount throughout the week: there isn't much variability between the days of the week. In order to choose a statistical test, let's make an inventory of observations:
- Each distribution is strongly skewed to the right, violating normality.
- The distributions of each day are visually similar except for Friday and Sunday, which has more outliers than any other day. The CDF's show this as well; Friday and especially Sunday have slightly different distribution of values around 10 players and after 25 players.
- Independence is strongly violated because the probability of a player logging on a certain day can be dependent on whether they logged in on previous days; some people might have gotten tired of the game and quit. In particular, there is a strong correlation between the hours of the day: if the average player count is increasing per hour before 8 PM, it will continue to increase until 8 PM.
- The variances of each day ranges from 59.85 on Saturday to 101.97, which is only a range of about 2x. Fine for ANOVA.
- The number of samples within each day is large; they range from 178 on Monday to 251 on Wednesday.
- Our data groups together our ideal statistical unit; ideally we would have each row in the dataset be one player and whether or not they were logged in or not. Unfortunately our collected data groups together all individuals who were online.
These assumptions make it clear that a one-way ANOVA is the wrong approach, especially due to the strong correlation structure between the hours. Moreover, the previous sections show the variability between the hours is much greater than the variability between the days, so we'd need to account for that.
Linear Mixed Model¶
One way to think about our data is with a nested structure. Now, our data is much better modeled with time series correlation structures but for the sake of argument we'll continue with trying to compare between weeks. Take the following results on weekly comparisons with a grain of salt (maybe even half a grain).
We can think about our data as structured in a nested format:
circle1 = plt.Circle((0.5, 0.5), 0.4, color = "#363636")
circle2 = plt.Circle((0.5, 0.4), 0.3, color = "#4D4D4D")
circle3 = plt.Circle((0.5, 0.3), 0.2, color = "#6B6B6B")
fig, ax = plt.subplots(figsize = (5, 5))
ax.spines[["left", "right", "top", "bottom"]].set_visible(False)
ax.set_xticks([], "", color = bgColor)
ax.set_yticks([], "", color = bgColor)
ax.text(0.5, 0.75, "Day of Week", horizontalalignment = "center", fontsize = "x-large");
ax.text(0.5, 0.55, "Hour", horizontalalignment = "center", fontsize = "x-large");
ax.text(0.5, 0.25, "player\ncount", horizontalalignment = "center", fontsize = "x-large");
ax.add_patch(circle1)
ax.add_patch(circle2)
ax.add_patch(circle3);
Each average player count is associated with an hour of day, and each hour is nested within each day. The grouping of hour will explain significant variation on player count, as the hourly average graph showed.
Therefore, we'll attempt to analyze day of week differences with a linear mixed model to capture the variance and correlation structure within the hours. This way we'll find the right home for the within-hour variance and after accounting for that variance, analyze whether there really is a difference between the days of the week. Keep in mind that treating hour as a single group does not allow us to model the correlations between different hours, only within one hour. For example, if the player count is high at 8 PM, for example, then it should also be high at 8 PM in other weeks. Our linear mixed model can capture these similiarities but it cannot capture the fact that playercount should be similar between 7 PM and 8 PM. This is one of the limitations of using a linear mixed model on time series data: the random effect structure only allows correlations within one single group of a random effect not between groups of a random effect. As a result, we expect this model to be incorrectly specified for our data and the results for any significant differences between weeks should not be trusted.
Our basic model is as follows. Let $Y_{i, j}$ be the number of players observed on day $i$ and hour $j$. Then $$Y_{ij} = \mu + \beta_1D_1 + \beta_2D_2 + \beta_3D_3 + \beta_4D_4 + \beta_5D_5 + \beta_6D_6 + H_{j} + \varepsilon_{ij}$$
where
- $\mu$ is the average playercount across all days and weeks for Monday (the week's reference level)
- $\beta_i$ is the day of week fixed effect; or how different the average player count is for day $i$ compared to Monday
- $D_i = 1$ if the player count observation is on day $i$, else 0
- $H_{j}$ is the hour $j$ random effect, and $H_j \sim N(0, \sigma_0^2)$ are independent errors
- Note that the only statistical parameter estimated from the data is $\sigma_0^2$. There is no covariance matrix for one random effect at all; meaning no hours are related to each other.
- $\varepsilon_{ij} \sim N(0, \sigma^2)$ are independent errors (not true in our data!)
We chose hour to be a random effect in order to model the dependence within the hours. The random effect of hour also allows each hour to have a different average difference $H_i$ from $\mu$, which is what we want to model. Lastly, we chose the week effect to be a fixed effect because this allows us to compare the "pure" effect of the day of week while holding hour of day fixed, meaning to compare the week effect across the same hours of the day! If this effect is significant then there does exist a statistically significant difference in player count between the days of the week, after holding hours fixed.
We'll be using mixedlm from the statsmodels package. A note before we examine the output: since we are treating dayOfWeek as a fixed effect (an intercept), you need to tell statsmodels that dayOfWeek is a categorical variable using C(dayOfWeek). Otherwise the model will think dayOfWeek is continuous and return one coefficient $\beta$ that interprets as "for each increase in the day of week, player count changed by $\beta$ holding hours fixed."
That's not what we want; we're trying to analyze whether there is a difference between the days, not how much it increases from Monday to Sunday.
model = smf.mixedlm("mean ~ C(dayOfWeek)", data = x2, groups = "hour", missing = "drop")
res = model.fit(reml = False)
print(res.summary())
Mixed Linear Model Regression Results
============================================================
Model: MixedLM Dependent Variable: mean
No. Observations: 1449 Method: ML
No. Groups: 24 Scale: 58.7650
Min. group size: 55 Log-Likelihood: -5043.0895
Max. group size: 65 Converged: Yes
Mean group size: 60.4
------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
------------------------------------------------------------
Intercept 12.740 1.044 12.205 0.000 10.694 14.786
C(dayOfWeek)[T.1] 0.735 0.792 0.928 0.354 -0.818 2.287
C(dayOfWeek)[T.2] 0.756 0.751 1.006 0.314 -0.717 2.229
C(dayOfWeek)[T.3] 0.054 0.782 0.069 0.945 -1.479 1.587
C(dayOfWeek)[T.4] 1.027 0.772 1.330 0.183 -0.486 2.541
C(dayOfWeek)[T.5] 0.238 0.782 0.304 0.761 -1.295 1.770
C(dayOfWeek)[T.6] 2.130 0.807 2.639 0.008 0.548 3.712
hour Var 18.219 0.729
============================================================
Recall that $1$ is Tuesday, $2$ is Wednesday, and so on. From the output, we can deduce:
- The average playercount across all hours on Monday is $\mu = 12.74$.
- The total variance explained by
houris $\sigma^2_0 = 18.219$ and the unexplained (residual) variance is $\sigma^2 = 58.765$, theScaleparameter in the output.- In particular, $\sigma^2_0$ not being close to $0$ tells us there is a large variance of player counts within hours, as we saw before. Therefore, inclusion of hour as a random effect makes sense; we found a home for the by-hour variance and are comparing between weeks at the same hour. We can describe how clustered the player counts within each hour using the intraclass correlation. It's calculated by looking at the proportion of $\sigma^2_0$ with respect to $\sigma^2$: $$\dfrac{\sigma^2_0}{\sigma^2_0 + \sigma^2} = \dfrac{18.219}{18.219 + 58.765} = 0.2367$$ Player counts within hours show a nonsignificant correlation. In other words, the player counts within an hour are similar to each other. Not surprising as we saw on average, player counts differ between hours.
- The trends we noticed in the weekly break down are present in our results!
- All of the coefficents of
dayOfWeekare positive, meaning that on average, all days will see more players compared to Monday during the same hours. However, the actual real-world differences are small; around 2 players or less. - Only Sunday with an average player increase of $2.13$ compared to Monday during the same hours is statistically significant with a p-value of 0.008 at $\alpha = 0.05$ using the Wald test.
- The slight increase in players on Tuesday and Wednesday in the earlier graph is reflected in the coefficients 0.735 and 0.736.
- Interestingly, Friday has the second largest increase in players and Saturday is barely discernable from Monday's player count. I expected Saturday to have more players on average but Friday being the second makes sense; people just ended the week and are looking to relax.
- All of the coefficents of
These results gives evidence that there is a statistically significant difference in player count between the days of the week during the same hours. In particular, we found significant effects of a "Sunday dip" in player count when people are busy with the week.
Our results only display pairwise comparisons of Monday vs. other days. How can we tell if the entire variable dayOfWeek is statistically significant?
Significance Test of Week Fixed Effect: Likelihood Ratio Test¶
Even though our p-value for Monday vs. Sunday is small, the Wald test's p-value assumes our model with and without the day of week fixed effect will be the same [3]. Our model may not be the same if there is correlation between day of week and hour. If there is correlation between the days of the week and hour of day, the p-value will be different than if we used a likelihood ratio test (LRT), which sets the fixed effect to zero allowing comparison without correlation between hour and day. For this we'll need to compare our model against a simpler model without the day of week fixed effect.
Let's do this now. Our null hypothesis is \begin{align*}H_0 &: \text{model with/without fixed week effect fits the data equally well} \\ H_a &: \text{model with fixed week fits the data better}\end{align*} We'll need to construct a reduced (generally called nested) model without the week fixed effect, then calculate $$A = -2 \left(\text{log likelihood of reduced model} - \text{log likelihood of full model}\right)$$
According to Wilk's theorem [4], $A \sim \chi^2_6$ due to a difference in six dummy parameters $\beta_1, \dots, \beta_6$, or six degrees of freedom.
modelReduced = smf.mixedlm("mean ~ 1", data = x2, groups = "hour", missing = "drop")
resReduced = modelReduced.fit(reml = False)
A = -2 * (resReduced.llf - res.llf)
sp.stats.chi2.sf(A, 6)
0.10607027569108776
With a p-value of $0.106$, at $\alpha = 0.05$ we do not find statistically significant evidence that the model with fixed week fits our data better. In other words, there probably does not exist a significant difference of player counts across different days at the same hours, especially due to violation of model assumptions. The difference between Sunday and Monday player count may be due to random chance!
Note: in fitting the mixed model with .fit(), we didn't use restricted maximum likelihood (reml = False). This is because two linear mixed models with different fixed effect structures estimated under REML have different random effect interpretations [5]. As a result, the reduced model is no longer nested under the null and the LRT doesn't make sense.
Checking Linear Mixed Model Assumptions¶
Earlier we said our model fails to model the correlation between hours. To check this, let's check the i.i.d. assumption for $\varepsilon$. If our model was correctly specified, then $\varepsilon \sim N(0, 58.765)$. Below is a plot of the residuals over time compared with random draws from $N(0, 58.765)$:
def lmmAssumptions(ax, resid = True):
from matplotlib.collections import PolyCollection, LineCollection
ax.set_ylim(-40, 50)
if resid:
ax.plot(x2.dropna().index, res.resid, color = lavender);
ax.set_title("Residuals over time", fontsize = "large")
else:
ax.plot(x2.index, np.random.normal(0, np.sqrt(58.765), x2.index.max() + 1), color = lavender);
ax.set_title("Normally distributed residuals", fontsize = "large")
for item in ax.collections:
if isinstance(item, PolyCollection): item.set_facecolor(lavender)
if isinstance(item, LineCollection): item.set_color(lavender)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
fig.subplots_adjust(wspace=0.1)
lmmAssumptions(ax1)
lmmAssumptions(ax2, False)
Clearly, normality of residuals is violated; our residuals look nothing like the normally distributed errors. In particular, the residuals still show a strong correlation over time, which is called autocorrelation. The random effect of hour only captured correlation within hours not autocorrelation! What effect does this have on our conclusions? In a linear regression model, if there is autocorrelation among residuals but no correlation between explanatory variables and residuals (called strict exogeneity), then the estimate for the slopes $\beta_i$ is unbiased [6 pp. 352]. However, the standard error and $t$-values will be too low or too high, depending on the type of autocorrelation [6 pp. 413-414]. This means the slope estimates should be accurate, but their uncertainty (the standard error) is incorrectly quantified. If these results hold for a linear mixed model, then we can expect our statistically significant increase of 2.13 players on Sunday vs. Monday to possibly not be statistically significant anymore due to the potential increase in standard error, decreasing the $t$ statistic. The same is true for the LRT in a mixed model; it assumes independent observations after accounting for within hour correlations. To rectify the situation, we could use Generalized Least Squares to deal with correlation across hours. However, we won't do this because later on we will see the autocorrelation structure is quite complex due to the combination of daily and weekly effects. Plus, player count across weeks likely does not differ so continuing with this line of analysis isn't very fruitful.
Conclusion: There isn't enough evidence to conclude there really is a difference in player count across all weeks. We'll now focus on time series techniques to model the autocorrelation between hours in a more appropriate way!
Handling Data Imputation¶
Our first graph showed general trends and daily cycles. These can be extracted using time series decomposition models, which usually allow no missing values. We had many missing values so we'll start by dealing with data imputation.
Before beginning, it's always a good idea to define our analysis goal on why we are deciding to impute. Ultimately, with 16.1% of missing data, using forecasting models which require all data values present can lead to serious errors. Therefore our goals for imputing are really just to be able to use decomposition models in order to uncover the trend and seasonalities. We will not use imputed data for forecasting; instead, we will choose models which allow forecasting on missing data.
Handling missing values in a time series require care; the missing values could be missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). MCAR is the ideal situation; it's when the probability of a value missing is independent from observed values and the missing values themselves. MAR is when the probability of a missing value depends on observed values but not the missing values, and MNAR is when the probability does depend on the missing values. Each type of missingness will affect the type of imputation that is appropriate.
Earlier we saw 16.1% of hours missing. These missing values are likely MCAR; the effect of missingness is independent from observed online because they resulted from bugs in my data collection program. They were not the result of server downtime, which would affect observed playercount. I proxied my bot's connection through free proxies and sometimes they would disconnect my bot randomly; my code would sometimes be able to reconnect but not always. The missingness is definitely not MNAR; if they were, then high playercounts would cause my bot to disconnect due to factors such as server load. However, this is not the case since we observed playercounts as high as 60+. Let's perform a Chi-square to see if our missing values does depend on observed values, or MAR. This will decide the imputation techniques that can be appropriate.
Chi-square Test For Missing Data Not Being MAR¶
We'll do a sanity check to see if the probability of missing data is likely MCAR (not dependent on missing values or observed values). If missingness is MCAR, then the probability of data missing in an hour should be similar to the probability of data present in an hour across all hours. This is because hour of day is strongly correlated with the player count, as we saw from the daily seasonality. Let's try.
MCARtest = x2.groupby("hour").agg(
{"mean" : [lambda x: x.isnull().sum(), lambda x: x.notnull().sum()]}
).reset_index(drop = True).rename(columns = {"<lambda_0>": "missing", "<lambda_1>": "present"})
MCARtest.transpose()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | missing | 13.0 | 11.0 | 10.0 | 10.0 | 8.0 | 8.0 | 7.0 | 8.0 | 8.0 | 9.0 | ... | 13.0 | 12.0 | 14.0 | 14.0 | 14.0 | 15.0 | 17.0 | 17.0 | 16.0 | 17.0 |
| present | 59.0 | 61.0 | 62.0 | 62.0 | 64.0 | 64.0 | 65.0 | 64.0 | 64.0 | 63.0 | ... | 59.0 | 60.0 | 58.0 | 58.0 | 58.0 | 57.0 | 55.0 | 55.0 | 56.0 | 55.0 |
2 rows × 24 columns
The missing hours seem to be lower in the morning and higher in the evening. This makes some sense because I was more likely to be awake in the morning, hence able to check on my bot if it was disconnected. But is it statistically significant? Our two variables are hour of day vs. missing/not missing so we need to test if they are associated with each other. The question we need to answer is "is the pattern of missing data independent of the hour of day?" so a Chi-square test of independence is appropriate here [1].
Our null hypothesis is $H_0:$ hour of day and missingness are independent and alternative hypothesis is $H_a:$ hour of day and missingness are associated. If we reject the null then that means missingness depends on observed values, meaning the MCAR assumption is rejected.
from scipy.stats import chi2_contingency
test = chi2_contingency(MCARtest["mean"][["missing", "present"]].astype(int).transpose())
test.pvalue
0.35068836193950104
At an $\alpha$ level of $0.05$ and with a $p$-value of $0.351$, we fail to reject the null hypothesis. Therefore we can conclude there is not sufficient evidence of an association between missingness and hour of day, or in other words, an association between missingness and observed playercount. Therefore the mechanism of missingness is not likely to be MAR, and possibly MCAR. We'll assume this for the rest of the analysis.
Visualizing Missingness¶
Before imputation let's take a look at the distribution of lengths of missing sequences. These lengths will decide what type of imputation technique is appropriate.
missingIndices = {}
missingIndex = 0
startOfNullSignal = False
numHoursMissing = 0
for i in x2.index:
if pd.isnull(x2.loc[i, "mean"]):
if startOfNullSignal == False:
startOfNullSignal = True
missingIndex = i
numHoursMissing += 1
prevHourIsMissing = True
elif not pd.isnull(x2.loc[i, "mean"]) and startOfNullSignal == True:
missingIndices.update({missingIndex: numHoursMissing})
numHoursMissing = 0
startOfNullSignal = False
def plotMissingDist(indices : dict, figsize):
fig, ax = plt.subplots(figsize = figsize)
ax.hist(indices.values(), color = lavender, bins = 100, width = 0.3)
ax.set_xlabel("Length of Missing Sequence")
ax.set_ylabel("Count")
plt.suptitle("Number of Sequences By Missing Length", y = 1.01)
ax.set_title("Total: {}".format(len(indices)), fontsize = "medium")
plotMissingDist(missingIndices,(10, 5))
Less than a quarter of the missing sequences are two hours or less! In contrast, there's a significant number of sequences that are missing 6 hours or more. We'll try two different imputation techniques: one a simple naive seasonal + forward fill and a more complex technique dealing with shorter and longer missing sequences. At the end we will compare the efficacy of both imputation techniques.
Imputation: Naive Seasonal + Forward Fill¶
Because our missing data is assumed to be MCAR, we don't necessarily need to impute based on estimating the underlying data distribution. Therefore, we'll stay away from maximum likelihood methods first (such as ARIMA) and begin with simple imputation techniques to start. Due to the observed strong daily seasonality, let's begin with a seasonal naive imputation by filling missing values with the same value from the previous day's hour. Recall that we have 279 total missing values:
imputed = x2.copy(deep = True)
imputed["isNA"] = imputed["mean"].isna()
imputed["mean"]= imputed["mean"].fillna(imputed["mean"].shift(24))#.ffill()
imputed["mean"].isna().sum()
84
Naive seasonal imputation was able to handle two-thirds of missing values. We'll handle the rest with a naive forward fill:
imputed["mean"] = imputed["mean"].ffill()
imputed["mean"].isna().sum()
0
We're left with zero missing values. Let's see how the time series looks now:
graph(imputed.index, imputed, "Average players, naive imputation", 1, (15, 5), firstGraph)
We can definitely see the negative impact of forward filling, especially during January 3rd. One whole day takes on only two different values, evidenced by the two horizontal lines. The missing data during January 3rd takes on a whole day of high playercount values which is less than ideal. Another spot seasonal imputation missed was January 23rd; there's almost a whole day missing there. To check whether our imputation technique changes the underlying data distribution, we can compare how different the data distributions are before and after imputation which we will do after the next imputation technique.
Linear Interpolation + Kalman Filter¶
In the paper Three-step Imputation of Missing Values in Condition Monitoring Datasets, the authors outline a method to deal with missing values by first discriminating missing sequences into three types [2]. For multivariate data, there are three types but for univariate time series like ours there are only two:
- Isolated Missing Values (IMV): A single or a sequence of missing values where the number of missing values is less than or equal to some threshold $T$
- Continuous Missing Sample (CMS): A continuous period of time where the data is missing where the number of missing values is greater than $T$
We need to define a cutoff $T$ in which a missing sequence is an IMV vs. a CMS. In the context of our game, because the peak and trough playercounts occur at one single hour by visually inspecting the time series, we'll define an IMV with $T = 1$, or a single missing hour. For a single missing value we'll perform a simple linear interpolation. This will take the two surrounding existing player counts and draws a line through them, interpolating the missing value with the average of the two surrounding player counts.
imputedTwo = x2.copy(deep = True)
for index, numMissing in missingIndices.items():
if numMissing == 1:
imputedTwo.iloc[index - 1 : index + 2] = imputedTwo.iloc[index - 1 : index + 2].interpolate(method = "linear")
For the CMS sequences, we'll use the Kalman filter. The filter allows for imputation of missing data based on hidden (unobserved) states and our observed data $y_t$. In general, the Kalman will model our playercount $y_t$ at time $t$ by the following:
\begin{align*} h_0 &\sim \mathcal{N}(\mu_0, \sigma_0) \\ h_{t+1} &= a_th_t + b_t + \varepsilon_{t+1} \\ y_t &= c_th_t + d_t + \varphi_t \\ \varepsilon_t &\sim \mathcal{N}(0, Q) \\ \varphi_t &\sim \mathcal{N}(0, R)\end{align*}
where $h_t$ is the hidden state. This filter will allow player count from previous time points $t-k$ to provide information to our current playercount $y_t$. When using the Kalman filter, the constants $a_t, b_t, c_t, d_t, \mu_0, \sigma_0$ need to be chosen. For simplicity, we'll set the constants $\mu_0 = b_t = d_t = 0$ and $\sigma_0 = c_t = 1$. So our equations become
\begin{align*} h_{t+1} &= a_t h_t + \varepsilon_{t+1} \\ y_t &= h_t + \varphi_{t} \end{align*}
Substituting $h_t$ reveals an equation similar to the AR(1) structure: $$y_t = a_{t-1}h_{t-1} + \varepsilon_{t} + \varphi_{t}$$
except that $h_{t-1}$ is a hidden state, not $y_{t-1}$, and there's two normally distributed errors. I chose this structure intentionally to see if the Kalman filter could capture some seasonality effects, as the autoregressive models can!
The Kalman filter can be used for imputation because in estimation, there's a filtering phase and a smoothing phase. During filtering, the hidden state $h_t$ is updated with all time points before $t$, producing $h_{t+1}$. After filtering, the smoothing phase updates all the hidden states before time $t$. As a result, all of the final hidden states will contain all information from the entire data set. However, if for some reason a single $y_t$ player count was missing, then only the filtering stage is performed, perhaps for multiple missing hours, creating hidden states based on old data. At the next occurences of a new player count, both filtering and smoothing occur, which updates the hidden states that had missing data based on this new information. At the end of the process, the expected value of $h_t$ is our best prediction for the missing time point at time $t$.
We'll use the pykalman Kalman filter, and during estimation, we'll only allow $Q, R$ and $\sigma_0$ to change as the Kalman filter learns the data. We'll perform a grid search on $a_t$ over $[-2, 2]$ in order to find the best that fits our data.
from matplotlib.animation import FuncAnimation
fig, ax = plt.subplots(figsize = (15, 5))
ax.plot(imputedTwo["mean"], color = lavender, label = "Actual data");
ax.set_xlabel("Time")
ax.set_ylabel("Average players")
ax.set_xticks(imputedTwo.index[::96], imputedTwo["date"][::96])
measurements = np.ma.masked_array(imputedTwo["mean"], pd.isnull(imputedTwo["mean"]))
kf = KalmanFilter(transition_matrices = -2, transition_covariance = 1, transition_offsets = 0,
initial_state_mean = 0, initial_state_covariance = 1,
observation_matrices = 1, observation_offsets = 0, observation_covariance = 1,
em_vars = ["transition_covariance", "observation_covariance", "initial_state_covariance"])
kf = kf.em(measurements, n_iter = 5)
(smoothed_state_means, smoothed_state_covariances) = kf.smooth(measurements)
iterablePlot, = ax.plot(smoothed_state_means, alpha = 0.5, color = lightBlue, label = "Kalman filter estimation");
ax.set_title("Transition number with 0", fontsize = "large")
ax.set_xticks(imputedTwo.index[::96], imputedTwo["date"][::96])
ax.legend()
def func(frame):
kf = KalmanFilter(transition_matrices = frame, transition_covariance = 1, transition_offsets = 0,
initial_state_mean = 0, initial_state_covariance = 1,
observation_matrices = 1, observation_offsets = 0, observation_covariance = 1,
em_vars = ["transition_covariance", "observation_covariance", "initial_state_covariance"])
kf = kf.em(measurements, n_iter = 5)
(smoothed_state_means, smoothed_state_covariances) = kf.smooth(measurements)
iterablePlot.set_ydata(smoothed_state_means)
iterablePlot.set(alpha = 0.5, color = lightBlue)
ax.set_title("Transition with {}".format(round(frame, 2)), fontsize = "large")
return iterablePlot
#ani = FuncAnimation(fig, func, frames = np.arange(-2, 2.05, 0.05), interval = 200, repeat = True, repeat_delay = 200)
#ani.save("kalman.gif");
Here's a .gif of the results:

Visually speaking, an initial value of $a_0 \approx$ 0.8-0.9 both fits our data well and can model the seasonality effects, especially during January 3rd when we were missing two days. The daily cycle of players is represented there, so the Kalman filter can model daily cycles! It's definitely not the best estimation as January 11th had 35 missing hours, including 8 PM, and the Kalman filter completely missed that behavior.
Typically speaking, when imputing data you should create a training set on the fitted model, then validate the model with a test set on existing data. Because we're imputing data only to use on decomposition models, we'll skip this step.
Let's impute our data with the Kalman smoothing estimation with $a_0 = 0.85$.
kf = KalmanFilter(transition_matrices = 0.85, transition_covariance = 1, transition_offsets = 0,
initial_state_mean = 0, initial_state_covariance = 1,
observation_matrices = 1, observation_offsets = 0, observation_covariance = 1,
em_vars = ["transition_covariance", "observation_covariance", "initial_state_covariance"])
kf = kf.em(measurements, n_iter = 5)
smoothed_state_means, _ = kf.smooth(measurements)
smoothed_state_means = np.reshape(smoothed_state_means, (smoothed_state_means.shape[0]))
for i in imputedTwo[imputedTwo["mean"].isna()].index:
imputedTwo.at[i, "mean"] = smoothed_state_means[i]
Comparing Imputation Methods¶
To compare our imputation methods, we'll use the assumption that our data is MCAR. We can assume our existing data is a reasonably representative sample of the population playercounts between November and January. Therefore, whichever imputation method resulted in a more similar distribution of actual playercounts is the one we will choose.
The Kalman filter introduced new player count means that don't exist in the sampled player counts, whereas the naive seasonal + forward fill kept the same values. So to compare the imputed distributions to the existing player counts, we'll look at their CDFs:
imputedX = np.sort(imputed["mean"])
imputedProb = np.arange(1, len(imputedX) + 1) / len(imputedX)
imputedTwoX = np.sort(imputedTwo["mean"])
imputedTwoProb = np.arange(1, len(imputedTwoX) + 1) / len(imputedTwoX)
x2X = np.sort(x2["mean"].dropna())
x2Prob = np.arange(1, len(x2X) + 1) / len(x2X)
fig, ax = plt.subplots(figsize = (9,4.5))
ax.set_xlabel("Players")
ax.set_ylabel("Cumulative probability")
ax.set_title("Comparison of imputed distributions against actual", fontsize = "large")
plt.plot(x2X, x2Prob, color = "white", label = "Actual CDF");
plt.plot(imputedX, imputedProb, color = lightBlue, alpha = 1, label = "Naive imputation CDF")
plt.plot(imputedTwoX, imputedTwoProb, color = lavender, alpha = 1, label = "Kalman + linear imputation CDF");
ax.legend();
From the graph it's clear the Kalman + linear interpolation imputation method resulted in a distribution more similar to our existing sample of player counts. The naive imputation underestimates the player count between 5 to 30 players, which is from 10% to 90% of total values. We could have used the Kolmogorov-Smirnov test but the test is meaningless here since it only looks at the maximum vertical distance between the CDFs. Therefore, we'll use the Kalman + linear imputed data for decomposition models!
Bird's-eye View¶
It's finally time for time series modeling! Thanks for making it this far :)
Let's look at the big picture: what are the general trends and daily/weekly fluctations? We'll use decomposition models to answer this, and after that we'll zoom in and analyze player behavior.
Trend: Classical Decomposition¶
The classical decomposition can help us get a sense for the general level of player activity after accounting for the daily fluctuations. Based on the data and the seasonal fluctuation not varying too much as the total player count increased, the additive decomposition
$$y_t = T_t + S_t + R_t$$
where $T_t$ is the core playerbase, $S_t$ is the daily player fluctuations, and $R_t$ is residual noise, is the is most appropriate. For decomposition let's pick a moving average window of 96 hours, coinciding with our seasonality but with a wider moving average to smooth out the noise:
def secondGraph(x, data, ax, label, alpha):
ax.plot(x.index, data, label=label, alpha=alpha, color=lavender)
ax.set_xticks(x.index[::168], x[::168])
ax.set_title(label, fontsize = "x-large")
imputed = imputedTwo[["date", "mean"]]
decomp = tsa.api.seasonal_decompose(imputed["mean"], period=96)
graph(imputed["date"], decomp.trend, "Trend", 1, (15, 5), secondGraph)
The trend represents the games' core player engagement after removing daily fluctuations. There are four spikes in the playercount, which makes sense in the context of our game. The ones at 11/21 and 12/03 were due to new dungeons and items being released around that time, leading to more player engagement due to interest. The spikes during 12/19 and 12/27 can be explained by our playerbase; most people are in school so Christmas break is around that time. This led to a significantly higher core playerbase for about two weeks due to more leisure time to enjoy games. The dip after January 3rd shows the effect of Christmas break has largely ended and players are heading back into school. After that, our core playerbase slowly died down until it reached pre-winter break levels at the end of January.
To interpret the seasonal component, we can look at the largest value in the seasonality, which tells us the hour that players are most engaged:
decomp.seasonal[:24].idxmax()
20
8 PM, just like we saw earlier! If we solve for the residual noise term, the equation becomes $R_t = y_t - T_t - S_t$. This means we should have removed all daily player effects, or in other words, the autocorrelation between the hours of the day due to the daily player cycles. But looking at the autocorrelation function (ACF) and the partial autocorrelation function (PACF) on the residuals tells us a different story:
def thirdGraph(resid, ax, acf = True):
from matplotlib.collections import PolyCollection, LineCollection
if acf:
tsaplots.plot_acf(resid, ax = ax, color = lavender, lags = 49)
ax.set_title("Autocorrelation")
else:
tsaplots.plot_pacf(resid, ax = ax, color = lavender, lags = 49)
ax.set_title("Partial Autocorrelation")
for item in ax.collections:
if isinstance(item, PolyCollection): item.set_facecolor(lavender)
if isinstance(item, LineCollection): item.set_color(lavender)
resid = decomp.resid.dropna()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))
fig.subplots_adjust(wspace=0.25)
thirdGraph(resid, ax1)
thirdGraph(resid, ax2, False)
Both of these plots should have nonsignificant spikes if our data is not autocorrelated. However, the ACF is slowly decaying while the PACF has a strong spike at 1,2,4, 10 and small but significant spikes around 26-40. The significant spikes around 26 tells us there is a daily seasonal effect that the classical decomposition was unable to remove from our residuals. This is indicative of an autoregressive and seasonal autoregressive correlation structure, which we'll need to account for in forecasting.
Seasonality: MSTL¶
The classical decomposition model was only able to capture the daily fluctuations in players. In order to look at daily and weekly player cycles simultaneously, we'll use the MSTL: Multiple Seasonal-Trend decomposition using LOESS. This decomposition model allows for multiple seasonal effects, and in particular, we're interested in whether a weekly fluctuation exists. The decomposition we'll be looking at is $$y_t = T_t + S^{\text{daily}}_t + S^{\text{weekly}}_t + R_t$$
where $S^{\text{daily}}_t$ is the daily fluctuation in players and $S^{\text{weekly}}_t$ is the weekly fluctuation in players.
from statsmodels.tsa.seasonal import MSTL
mstl = MSTL(imputed["mean"], periods=(24, 168), windows=(7, 13))
decomp = mstl.fit()
def mstlGraph(ax, data, title, method = 0):
ax.set_xlabel("Time")
ax.set_title(title, fontsize = "x-large")
ax.plot(data.index, data, alpha = 1, color = lavender)
ax.set_ylabel("Average players", fontsize = "small")
ax.set_xticks(data.index[::168], imputed["date"][::168], fontsize = "small")
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(14, 10))
fig.subplots_adjust(hspace = 0.75)
mstlGraph(ax1, decomp.trend, "Trend")
mstlGraph(ax2, decomp.seasonal["seasonal_24"], "Daily seasonality")
mstlGraph(ax3, decomp.seasonal["seasonal_168"], "Weekly Seasonality")
The general trend remains the same: increased players at 11/30 and 12/21 due to new content and winter break, respectively. The seasonality in this model tells us a lot about the game: the player cycles are captured and they fluctuate more when the trend reaches higher values. For example, the increased players at 11/30 and 12/21 corresponded to a much larger daily fluctuation in the player base. This implies the variance of daily cycles could be dependent on the total number of players online.
Our graphs start on a Tuesday (11/16). In the weekly seasonality graph, there is a very small but interesting weekly cycle: we can see that playercount generally trends down from Monday to Thursday, spikes on Friday (the small peak before the big one), decreases on Saturday, and reaches a maximum on Sunday. This is analgous with the statistically insignificant result from the linear mixed model. However, as time goes on, the weekly seasonality actually degenerates into a weekly cycle that has the most players on Friday, starting around 1/4. You can see the Sunday spikes are lower than the Friday spikes after that! My hypothesis is people played enough of our game on Friday so that they did not feel as much of a need to play on Sunday.
After decomposing both weekly and daily cycles, the remaining error term $R_t$ still exhibits significant autocorrelation:
resid = decomp.resid
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))
fig.subplots_adjust(wspace=0.25)
thirdGraph(resid, ax1)
thirdGraph(resid, ax2, False)
In particular, the extreme spike at 25 and 49 in the PACF graph is really telling us that the same hours among days are significantly correlated with each other, the same result we saw using random effects for hours. Let's analyze the correlation structure in detail, which will inform us what type of forecasting structure we need to capture these autocorrelations.
Autocorrelation Structure¶
The decomposition models were adequate in an exploratory analysis on trends, which helped us discover some core insights on the weekly trend changing over time. But they do not properly explain the correlation structure in our data. Let's inspect our original unimputed data to determine the correlation structure.
In general, the autocorrelation describes how a time point $y_t$ is dependent on all data from time points $y_0, \dots, y_{t-1}$. The partial autocorrelation removes all linear dependence on intermediate terms and looks at the correlation between $y_t$ and $y_0$. This is why we analyzed the residuals using ACF and PACF plots; if errors were uncorrelated, then their spikes at any lags will be within the purple (insignificant) region.
Using ACF and PACF plots requires a time series to be stationary, meaning the statistical properties of the data, such as mean and variance, don't change over time. We can deal with the mean by first centering our data:
diffed = x2.copy(deep = True)
diffed["mean"] = diffed["mean"] - diffed["mean"].mean()
Centering the mean does not remove the trend. We saw a strong trend over time, which does cause the mean to change over time. To test for stationarity, we can use the Augmented Dickey-Fuller test. It assumes our data can be modeled by dependence on the previous time point: $$ y_t = \varphi y_{t-1} + \varepsilon_t$$ where $\varepsilon$ is i.i.d. normally distributed. $\varphi$ is called the root, and this model is called an autoregressive model of order 1, denoted $AR(1)$. In particular, this model accounts for dependence near time points, and if you choose the right autoregressive model, the residuals $\varepsilon$ will really be i.i.d.
We have $H_0:$ our time series has a unit root and $H_a:$ there is no unit root.
If there was a unit root, $\varphi = 1$ and our model becomes $y_t = y_{t-1} + \varepsilon$, also known as a random walk, which is not stationary as the variance grows over time.
tsa.stattools.adfuller(diffed["mean"].dropna(), regression = "ct")[1]
0.2200362750230041
We fail to reject the null, so our data is most likely not stationary. To deal with this, we usually take a first difference of our data to remove the trend. A first difference is usually enough to remove the trend as the individual time points now represent a change in player count from hour to hour. The first difference won't be enough to remove stationarity so we'll also take a seasonal diff of 24 hours.
def seasonalDiff(index, data, ax, label, alpha):
ax.plot(index, data["mean"], label = label, alpha = alpha, color = lavender)
ax.set_title(label, fontsize = "x-large")
ax.set_xticks(index[::168], data["date"][::168])
diffed["mean"] = diffed["mean"] - diffed["mean"].diff(1)
graph(diffed.index, diffed, "1 hour difference", 1, (10, 2), seasonalDiff)
print("p-value for first difference:", tsa.stattools.adfuller(diffed["mean"].dropna(), regression = "ct")[1])
diffed["mean"] = diffed["mean"] - diffed["mean"].diff(12)
graph(diffed.index, diffed, "1 + 24 hour difference", 1, (10, 2), seasonalDiff)
print("p-value for first difference + 24 hour difference:", tsa.stattools.adfuller(diffed["mean"].dropna(), regression = "ct")[1])
p-value for first difference: 0.13028109998337806 p-value for first difference + 24 hour difference: 0.029804525026717647
We have some evidence of a stationary series (p = 0.03); the time series looks semi-stationary as well. Let's inspect the correlation structure:
resid = diffed["mean"].dropna()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.subplots_adjust(wspace=0.25)
thirdGraph(resid, ax1)
thirdGraph(resid, ax2, False)
The ACF is slowly decaying as time passes with no clear independent spikes. In the PACF, the lag 1, 2, 4 spikes are outside the purple band, therefore significant. Because lag 2 is significant, $y_t$ depends on $y_{t-2}$, after removing the linear effect of $y_{t-1}$. The last significant spike (here, either 2 or 4), tells you the the number of terms, or order, you should include in an autoregressive model. In this case, considering the 2nd spike, we may have an $AR(2)$ process with $$y_t = \varphi_1y_{t-1} + \varphi_2y_{t-2} + \varepsilon$$
However, there are a group of significant spikes near 12 hours and 22 hours. If you look closely, one significant spike around 42-43 also exists, which is around two days at the same hour. This means $y_t$ is correlated with $y_{t-48}$. We saw a spike before 24 in the PACF this time because we took a one hour and 24 hour difference, pushing up the lags by two from 24 to 22. In particular, the correlation between now and 48 hours ago urges us to model dependence between $y_t$ and $y_{t-24}$. All evidence throughout this analysis points towards SARIMA as a good start for modeling the AR process, the seasonality (S), and removing the trend for a stationary series by differencing (the I). Let's try!
Forecasting and Statistical Models
We've completed a thorough exploratory analysis on our data and discovered many insights about player behavior and trends within our game. Now it's time to compare forecasting models. Because of the 16.1% of missing data, we'll be choosing forecasting models that can deal with missing data.
We'll hold out the last 72 hours, or 3 days, in our data to see how our models perform with respect to MAE.
endog = x2["mean"][:-72]
exog = x2.index[:-72]
testEndog = x2["mean"][-72:]
testExog = x2.index[-72:]
def MAE(predicted, actual): return (predicted - actual).abs().mean()
Naive Seasonal Forecaster¶
Let's start with the simpliest forecasting model based on daily seasonality: predicting based on the hour of the day prior. If models cannot beat this MAE, then they are worse than a naive imputation.
Caveat: our holdout set is highlighted in purple. A naive seasonal forecaster will fail to forecast at the beginning of our data, resulting in potential lower MAE.
def endOfData():
fig, ax = plt.subplots(figsize = (10, 5))
ax.set_xlabel("Time")
ax.set_ylabel("Average players")
ax.plot(x2["mean"][-300:], label = "Actual data", alpha = 0.5, color = "white")
ax.set_xticks(x2.index[-300::72], x2["date"][-300::72])
return fig, ax
_, ax = endOfData()
ax.plot(x2["mean"][-72:], alpha = 1, label = "Test set", color = lavender)
ax.legend();
MAE(x2["mean"].shift(24)[-72:], testEndog)
5.850330087087316
The average MAE for a naive seasonal forecaster is 5.85. This is the number to beat! Our forecast doesn't look too bad.
fig, ax = endOfData()
ax.plot(testExog, x2["mean"].shift(24)[-72:], label = "Naive seasonal forecast", color = lavender, alpha = 1);
ax.set_title("Forecast of naive seasonal", fontsize = "x-large")
ax.legend();
SARIMA¶
SARIMA is a great model to use when there is missing data. The ACF/PACF plots from the previous section highly suggest an AR(2) with one daily and hourly difference. This is where we will begin. In general, ARIMA will first perform the daily difference and daily difference, so our player count at time $t$ becomes $$z_t = (y_t - y_{t - 1}) - (y_{t-24} - y_{t - 25})$$
Since $y_t - y_{t-1}$ is the hourly change of player count at time $t$, the difference is modeling the daily hourly change in player count. These new $z_t$ are then used in the $AR(2)$ model:
$$z_t = \varphi_1 z_{t-1} + \varphi_2 z_{t-2} + \varepsilon$$
with $\varepsilon$ normal i.i.d. In general, these equations can get very complicated quickly; imagine substituting $z_{t-1}$ and $z_{t-2}$ in that equation! Let's find an appropriate SARIMA model for our data.
Model Selection¶
Selecting an appropriate model for SARIMA follows three general steps:
- Selecting parameters
- Determine the order of differencing to fix stationary issues. This is necessary because estimation of AR (assuming $|\varphi| < 1$) and MA usually assume stationary data. AR is also usually a stationary process. Differencing removes trend, allowing AR/MA to properly model the data.
- Selecting the right $AR$ order $p$ and $MA$ order $q$.
- Usually done by looking at the AICc, which scores a model based on how well it fits + penalty on how many parameters in the model. Smaller AICc is better.
- Look at ACF/PACF of residuals. If any spikes are significant, your residuals are still correlated; missing AR/MA orders.
- Consider comparing models with forecast accuracy metrics such as MAPE or MAE. Ideally we would use MAPE, but we will use MAE due to zeros in our test set.
- Estimate the parameters. We will use
ARIMAfrom statmodels. - Check final model: are residuals autocorrelated/partially autocorrelated? Is the mean/variance constant over time? Use the Ljung-Box test for testing autocorrelation.
These three steps are called the Box-Jenkins Method [7]. Let's start with $ARIMA(2,1,0)(0,1,0)_{24}.$
def arimaSelection(p, d, q, P, D, Q, M):
model = tsa.arima.model.ARIMA(endog, exog, order = (p, d, q), seasonal_order = (P, D, Q, M))
res = model.fit()
resid = res.resid.dropna()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
fig.subplots_adjust(wspace=0.25)
thirdGraph(resid, ax1)
thirdGraph(resid, ax2, False)
print("MAE:", MAE(res.forecast(72, exog = testExog), testEndog))
print("AICc:", res.aicc)
return model, res
model, res = arimaSelection(2, 1, 0, 0, 1, 0, 24)
MAE: 9.195002100804366 AICc: 8492.388948801858
We're missing a few effects. The significant spike in ACF at lag 1 indicate we need at least a $MA(1)$ component, and the significant spike in the PACF at lag 23 suggests a seasonal $AR(1)$ component.
arimaSelection(2, 1, 1, 1, 1, 0, 24);
MAE: 5.446571235358854 AICc: 7961.806249472008
modelma1, resma1 = arimaSelection(2, 1, 2, 1, 1, 1, 24);
MAE: 5.250380947909949 AICc: 7569.391392816136
modelar2, resar2 = arimaSelection(2, 1, 2, 2, 1, 0, 24);
MAE: 4.753395865331563 AICc: 7813.289441743976
We tried lots of other models. Here are our final results:
| Model | MAE | AICc |
|---|---|---|
| $ARIMA(2, 1, 0)(0, 1, 0)_{24}$ | 9.195 | 8492.389 |
| $ARIMA(2, 1, 1)(1, 1, 0)_{24}$ | 5.447 | 7569.391 |
| $ARIMA(2, 1, 2)(1, 1, 1)_{24}$ | 4.938 | 7569.319 |
| $ARIMA(2, 1, 2)(2, 1, 0)_{24}$ | 4.753 | 7813.289 |
| $ARIMA(3, 1, 2)(2, 1, 0)_{24}$ | 4.938 | 7812.998 |
| $ARIMA(2, 1, 3)(2, 1, 0)_{24}$ | 5.094 | 7806.444 |
| $ARIMA(2, 1, 2)(3, 1, 0)_{24}$ | 4.837 | 7768.589 |
| $ARIMA(2, 1, 2)(2, 1, 1)_{24}$ | 5.274 | 7570.637 |
| $ARIMA(2, 1, 2)(1, 1, 2)_{24}$ | 5.307 | 7569.973 |
| $ARIMA(3, 1, 2)(1, 1, 1)_{24}$ | 5.252 | 7571.406 |
| $ARIMA(2, 1, 3)(1, 1, 1)_{24}$ | 5.252 | 7571.407 |
| $ARIMA(3, 1, 2)(2, 1, 2)_{24}$ | 5.257 | 7574.923 |
| $ARIMA(3, 1, 2)(3, 1, 0)_{24}$ | 4.822 | 7770.38 |
In general, our best model for MAE was $ARIMA(2, 1, 2)(2, 1, 0)_{24}$ with 4.753 while for AICc is $ARIMA(2, 1, 2)(1, 1, 1)_{24}$ with 7569.319.
However, $ARIMA(2, 1, 2)(1, 1, 1)_{24}$ has autocorrelated residuals seen here and 41 out of 48 lags have statisitically significant residuals:
(resma1.test_serial_correlation("ljungbox", df_adjust = True, lags = 48) < 0.05).sum()
41
Therefore we'll choose the model for $ARIMA(2,1,2)(2,1,0)_{24}$ for forecasting.
Forecasting¶
Before forecasting we need check the assumptions of the error term (residuals): normally distributed and uncorrelated.
res = resar2
s = (res.test_serial_correlation("ljungbox", df_adjust = True, lags = 48) < 0.05).sum()
print("{}/48 lags show statistically significant autocorrelation".format(s))
36/48 lags show statistically significant autocorrelation
Ljung-Box test shows 36 lags are significantly autocorrelated. However, looking at the previous correlograms the significant lags don't seem to be as significant as the test claims.
We can test normality with the Jarque-Bera test, significant p-values means the residuals are unlikely to have originated from a normal distribution.
sm.stats.stattools.jarque_bera(res.resid.dropna())[1]
0.0
A p-value of 0 shows extreme deviation from a normal distribution. We can see this in the graphs of the residuals below: most of the error terms are strongly clustered around 0.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10, 5));
ax1.hist(res.resid, bins = 50, width = 0.5, color = lavender);
ax2.plot(res.resid.index, res.resid, color = lavender);
plt.suptitle("Normality is violated", y = 1.01);
ax2.set_xlabel("Residuals from ARIMA");
Even though this model does not pass the Ljung-Box test for lag 48 and shows deviations from normality, in practice we would use one of the best models we could find for forecasting [8]. Most of our models failed normality tests and showed autocorrelation before the 48th lag. Let's see how it performs on the test set:
fig, ax = endOfData()
newx = np.hstack([testExog, [testExog.max() + i for i in range(28)]])
fore = res.get_forecast(100, exog = newx)
ax.plot(newx, fore.predicted_mean, label = "ARIMA(2, 1, 2)(2, 1, 0) forecast", color = lavender, alpha = 1);
ax.fill_between(fore.predicted_mean.index, y1 = fore.conf_int()["lower mean"], y2 = fore.conf_int()["upper mean"], color = lavender, alpha = 0.15)
ax.set_title("Forecast of best ARIMA model, MAE: 4.75", fontsize = "x-large")
ax.legend();
Our forecast definitely captures the daily player seasonality well due to the seasonal autoregressive components. However, the real playercount was trending downwards after January 22 and the general trend was not captured by this model. Also, the predictions generated near the beginning of the test set are already not very close to the real test set values. The 95% confidence intervals are also already pretty wide starting at the beginning of the test set predictions. In any case, the SARIMA model seems to model our data's inherent seasonality structure well. In particular, the MAE of 4.753 beats the naive seasonal forecast of 5.85.
One thing to note is the SARIMA only allowed for one seasonality effect, which we defined to be 24 hours. Our MSTL decomposition showed a small but slight weekly effect, let's use a model which can capture two seasonalities at once: unobserved components.
Unobserved Components¶
Unobserved Components is a powerful state-space model that was first introduced by Andrew Harvey in his book Forecasting, Structural Time Series Models and the Kalman Filter [9]. The main idea behind unobserved components is it's useful to break down a time series into trend, seasonality, a cyclic component, autocorrelated errors, and a regression component. In general, the model assumes for each time point $t$: $$y_t = \mu_t + \gamma_t + c_t + \sum_{j=1}^k\beta_j x_{jt} + \varepsilon_{t} + \sum_{i}^p\varphi_i\varepsilon_{t - i} $$
where $\mu_t$ is the trend, $\gamma_t$ is the seasonal component, $c_t$ is the cyclic component, $\sum_{j=1}^k\beta_j x_{jt}$ is the regression component, $\varepsilon_t$ is the error and $\sum_{i=1}^p\varphi_i\varepsilon_{t-i}$ are the autocorrelated errors. These components are assumed to be unobserved; our data is assumed to be one single time series generated from the hidden variables. Each of these terms are time-variant; they can change across time. Also, the seasonality effects require you to specify a period, and they must sum to 0 across one single period. Estimation is performed using Kalman filtering.
The advantage of unobserved components is each component is similar to a decomposition model, in which you can plot and analyze each component. It also comes with the benefit of being able to specify how many seasonal peaks/troughs and autocorrelated errors you wish to model, unlike decomposition models.
Model Selection¶
Our data did not include any independent variables so we do not have a regression component. We tried many combinations of possibilities for specifying each component, but here is the best we found, based on exploratory analysis guiding our intuition:
For trend, we'll assume a smooth trend, which says the current average player count $\mu_t$ depends on the previous average $\mu_{t-1}$ with a hidden value $\beta_{t-1}$ added: \begin{align*} y_t &= \mu_t + \varepsilon_t \\ \mu_t &= \mu_{t-1} + \beta_{t-1} \\ \beta_{t} &= \beta_{t-1} + \zeta_t\end{align*} with $\epsilon_t \sim N(0, \sigma_{\epsilon}^2)$ and $\zeta_t \sim N(0, \sigma_{\zeta}^2)$. This makes sense as the trends we saw didn't have huge fluctuations so dependence on the previous player count and a small hidden value $\beta_{t-1}$ adjustment makes sense here.
For seasonality: In our exploratory analysis, both daily and weekly seasonalities were present. Unobserved components allows us to directly specify how long our periods are for each seasonal component and how many peaks/troughs within each period. The total number of peaks/troughs is called harmonics. For the daily cycle, we'll specify 1 harmonic to capture the one peak/trough per day. For the weekly seasonality, we went with 14 harmonics to be able to capture differences in each day's effect on the playercount, and leave 7 harmonics left to capture smaller effects that could happen each day. These seasonalities are allowed to change over time, which is great since we saw the daily player fluctuations increased over winter break.
Autoregressive errors: after accounting for trend and seasonality, we should have a stationary distribution. However, our remaining component will most likely have some autocorrelation, so we chose our current error at time $t$ to depend on two previous errors.
Cyclic component: None because our data showed clear daily/weekly seasonalities and no other prominent cycles.
model_params = {
"irregular" : True,
"freq_seasonal" : [{'period': 24, 'harmonics': 1}, {'period': 168, 'harmonics': 14}],
"level" : "strend",
"stochastic_level" : False,
"autoregressive": 2
}
model = tsa.statespace.structural.UnobservedComponents(endog, **model_params)
Before fitting the model, i set the initial hidden state values all to 1. I found this gave better results in the end in terms of MAE on the test set.
obs = 34
model.initialize_known([1 for i in range(obs)], np.eye(obs))
res = model.fit(method = "powell", optim_complex_step = True);
Optimization terminated successfully.
Current function value: 2.262253
Iterations: 4
Function evaluations: 368
Here's a decomposition of each of the components on the training set:
def plotUC(x, y, data, name):
axs[x, y].plot(data, color = lavender)
axs[x, y].set_title(name);
axs[x, y].set_xticks(endog.index[:-72:168], x2["date"][:-72:168]);
fig, axs = plt.subplots(2, 2, figsize = (20, 8));
plotUC(0, 0, res.level["smoothed"], "Trend")
plotUC(0, 1, res.autoregressive["smoothed"], "AR errors")
plotUC(1, 0, res.freq_seasonal[0]["smoothed"], "Daily player fluctuation")
plotUC(1, 1, res.freq_seasonal[1]["smoothed"], "Weekly player fluctuation")
fig.subplots_adjust(hspace=0.3);
The estimated trend is in line with what we saw in MSTL and the classical decomposition: three to four peaks coinciding with new content released and winter break. The daily player seasonality was accurately captured and showed a distinct increase over time. In fact, the daily player fluctuations are proportionally related to the average number of players online; as we saw spikes in the average player count, the overall daily player fluctuations also increased! We saw evidence of this in the MSTL decomposition but not to the degree unobserved components has decomposed the model. The weekly player fluctuations might be a little overfit to the data; we saw in our earlier analysis Sunday, Friday, Wednesday and Tuesday saw the largest playercounts in that order. The largest spike is definitely Sunday, but the Friday spike is largely missing and is less than Tuesday and Wednesday spikes. I suspect this is due to picking a large number of harmonics for weekly cycles in an attempt to minimize MAE.
Forecasting¶
Before forecasting let's check the classic assumptions: uncorrelated residuals and normally distributed. We'll start with autocorrelation.
s = (res.test_serial_correlation("ljungbox", df_adjust = True, lags = 48) < 0.05).sum()
print("{}/48 lags show statistically significant autocorrelation".format(s))
9/48 lags show statistically significant autocorrelation
The autocorrelation is much better than our SARIMA model; only 9/48 lags have significant autocorrelation. However, normality is violated, just like before:
print("Jarque-Bera p-value:", res.test_normality("jarquebera")[0][1])
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10, 5));
ax1.hist(res.resid, bins = 50, width = 0.5, color = lavender);
ax2.plot(res.resid.index, res.resid, color = lavender);
plt.suptitle("Normality is violated", y = 1.01);
ax2.set_xlabel("Residuals from ARIMA");
Jarque-Bera p-value: 0.0
The residuals still show a dense concentration around the location and don't have enough variance to match a normal distribution. However, we'll continue with forecasting. This model shows a significantly better MAE than SARIMA on the test set of 3.718:
MAE(res.forecast(72, exog = testExog), testEndog)
3.7187550691053985
fig, ax = endOfData()
newx = np.hstack([testExog, [testExog.max() + i for i in range(28)]])
fore = res.get_forecast(100, exog = newx)
ax.plot(newx, fore.predicted_mean, label = "Unobserved Components Forecast", color = lavender, alpha = 1);
ax.fill_between(fore.predicted_mean.index, y1 = fore.conf_int()["lower mean"], y2 = fore.conf_int()["upper mean"], color = lavender, alpha = 0.15)
ax.set_title("Forecast of best Unobserved Components model, MAE: 3.72", fontsize = "x-large")
ax.legend();
As for the forecast, we can already see significantly better results than SARIMA. The trend component was accurately captured in this model; the predictions are slowly trending down over the three days. Moreover, the general peaks and troughs align with each day's peak and trough player count. Also, the prediction near the start of the test set is extremely close to the test set, unlike SARIMA. Just like the SARIMA model, the 95% prediction intervals are very uncertain after a few hours, already dipping below 0 players which is impossible.
In conclusion, unobserved components is the forecast winner with 3.12 MAE, SARIMA second with 4.75, and naive seasonal forecasting last with 5.85. Due to our 16.1% missing data we only chose to compare forecasting methods which can work on pure missing data with no imputation.
Conclusion
We performed a comprehensive exploratory and forecasting analysis on real life player data in a small online game. The core game insights we discovered were
- Strong daily player cycles, small weekly player cycles cycles. 8 PM PST is peak time for our game.
- Average number of players and daily player cycles strongly affected by in-game content releases and winter break.
- No statistically significant differences of players between the days of the week, perhaps due to low amount in difference of players.
The core diagnostic and statistical insights I used and learned were
- Drilling down to the player level and zooming out to the general behavior to develop a deep sense of player behavior.
- Chi-square test for evidence against missing at random (MAR), implying evidence for missing completely at random (since not MNAR).
- Model based methods are good for imputing continuous missing sequences as they capture the potential patterns within missing data. Could also have used unobserved components to impute data!
- Linear mixed models only capture variation within one level of a random effect, not between levels. Inappropriate for time series data with complex autocorrelation structure.
- ACF/PACF plots drive model selection by showing us autocorrelation structures we haven't dealt with in our data.
- Unobserved components forecasting does well to capture trend and explainable seasonal components, SARIMA not so much due to it being a model used on stationary series. Both are solid methods for forecasting on missing data.
For me: I drove this analysis based on our research questions and core exploratory analysis insights, which in turn motivated how we chose forecasting models. The exploratory analysis really showed me our modeling problem required daily + weekly seasonality + trend and methods that would work well on a large percent of missing data. Lastly, the core insights such as 8 PM being the peak player time and winter break driving large increases in playercount could be used to develop and target releases/in-game events to keep our players interested!
Thanks for making it to the end! I really enjoyed this analysis and learning about my playerbase :)
References
[1] Little, Roderick JA, and Donald B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, 2019, pp 305-306.
[2] Liu, Hang, Youyuan Wang, and WeiGen Chen. Three-step imputation of missing values in condition monitoring datasets. IET Generation, Transmission & Distribution 14.16 (2020): 3288-3300.
[3] University of Wisconsin-Madison. Mixed Models: Testing Significance of Effects. Social Science Computing Cooperative.
[4] Wilks, S. S. The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. The Annals of Mathematical Statistics, vol. 9, no. 1, 1938, pp. 60-62. JSTOR.
[5] Pinheiro, Jose C., and Douglas M. Bates. Mixed-effects models in S and S-PLUS. New York, NY: Springer New York, 2000, pp. 75-76.
[6] Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. South-Western Cengage Learning, 2016.
[7] Box, George EP, et al. Time Series Analysis: Forecasting and Control. John Wiley & Sons, 2015.
[8] Hyndman, Rob J, and George Athanasopoulos. Forecasting: Principles and Practice (2nd Ed). 8.9: Seasonal ARIMA Models, Apr. 2018.
[9] Harvey, Andrew C. Forecasting, Structural Time Series Models and the Kalman Filter. 1990.